15 research outputs found

    A Multi-view Context-aware Approach to Android Malware Detection and Malicious Code Localization

    Full text link
    Existing Android malware detection approaches use a variety of features such as security sensitive APIs, system calls, control-flow structures and information flows in conjunction with Machine Learning classifiers to achieve accurate detection. Each of these feature sets provides a unique semantic perspective (or view) of apps' behaviours with inherent strengths and limitations. Meaning, some views are more amenable to detect certain attacks but may not be suitable to characterise several other attacks. Most of the existing malware detection approaches use only one (or a selected few) of the aforementioned feature sets which prevent them from detecting a vast majority of attacks. Addressing this limitation, we propose MKLDroid, a unified framework that systematically integrates multiple views of apps for performing comprehensive malware detection and malicious code localisation. The rationale is that, while a malware app can disguise itself in some views, disguising in every view while maintaining malicious intent will be much harder. MKLDroid uses a graph kernel to capture structural and contextual information from apps' dependency graphs and identify malice code patterns in each view. Subsequently, it employs Multiple Kernel Learning (MKL) to find a weighted combination of the views which yields the best detection accuracy. Besides multi-view learning, MKLDroid's unique and salient trait is its ability to locate fine-grained malice code portions in dependency graphs (e.g., methods/classes). Through our large-scale experiments on several datasets (incl. wild apps), we demonstrate that MKLDroid outperforms three state-of-the-art techniques consistently, in terms of accuracy while maintaining comparable efficiency. In our malicious code localisation experiments on a dataset of repackaged malware, MKLDroid was able to identify all the malice classes with 94% average recall

    Scalable malware clustering through coarse-grained behavior modeling

    Get PDF
    Anti-malware vendors receive several thousand new malware (malicious software) variants per day. Due to large volume of malware samples, it has become extremely important to group them based on their malicious characteristics. Grouping of malware variants that exhibit similar behavior helps to generate malware signatures more efficiently. Unfortunately, exponential growth of new malware variants and huge-dimensional feature space, as used in existing approaches, make the clustering task very challenging and difficult to scale. Furthermore, malware behavior modeling techniques proposed in the literature do not scale well, where malware feature space grows in proportion with the number of samples under examination. In this paper, we propose a scalable malware behavior modeling technique that models the interactions between malware and sensitive system resources in a coarse-grained manner. Coarse-grained behavior modeling enables us to generate malware feature space that does not grow in proportion with the number of samples under examination. A preliminary study shows that our approach generates 289 times less malware features and yet improves the average clustering accuracy by 6.20% comparing to a state-of-the-art malware clustering technique

    Scalable analysis for malware and vulnerability detection in binaries

    No full text
    In recent years, malware (malicious software) has greatly evolved and has become very sophisticated. The evolution and the volume of malware make it difficult to analyze them in an effective and scalable manner. It is reported that, on average, more than 100,000 fresh malware binaries emerge each day. Despite the common use and widespread availability of various anti-malware tools, proliferation of malware poses a major security threat to computer systems. Identification of malware variants significantly improves the detection rate and reduces the size of malware signature databases. Thus, it has become crucial for security researchers to analyze and cluster malware samples based on their malicious behavior. Though manual analysis of malware may provide a better understanding of its malicious behavior, it is practically impossible due to the exponential growth in the number of new malware each day. As a result, security researchers are constantly playing the catch-up game with the malware authors. On the other hand, it is understood that software vulnerabilities are paving the way for malicious programs to find their targets, where attackers exploit the newly discovered vulnerabilities in the software systems to deliver their malicious payload. It is reported that more than 10,000 new vulnerabilities were reported in the first half of 2017. However, it is almost impossible to avoid vulnerabilities at the development stage (i.e., source code level); it is even difficult to discover vulnerabilities at the production stage (i.e., machine code level). Further, there are still lots of software applications in the wild with known security vulnerabilities that are unpatched even when the official patches are available. In most cases, attackers leverage on exploits written for such unpatched vulnerabilities to compromise the targets and achieve their malicious intent. Identifying the unpatched vulnerabilities and alerting the interested parties in a timely manner helps to significantly reduce the malware proliferation rate. In addition to detecting unpatched vulnerabilities, it is also important to identify potential vulnerable locations (or attack surfaces) within the software application that can turn into a zero-day (or 0-day) vulnerability in the future, which can be later exploited or weaponized by the malware authors to deliver malicious payloads. The first part of this thesis is dedicated to addressing the scalability issues in the existing malware behaviour modelling techniques. In this work, we propose and evaluate a bounded feature space behaviour modelling technique, named eBOFM, for effective and scalable malware detection and triaging tasks. The malware features extracted using eBOFM are very expressive in nature and provide valuable information to security analysts. Most importantly, the eBOFM feature space is bounded by an upper limit. Whereas, in existing malware behaviour modelling techniques, either the feature space grows in proportion to the size of the execution traces (e.g., n-gram models) or high computational power and memory is required (e.g., graph-based models), which makes them impractical to use in real-world malware detectors and triaging tools. We also perform large-scale experiments showing that, with eBOFM, computation time and memory usage are vastly lower than currently reported results in existing malware detection and clustering techniques, while preserving or even improving their high detection and clustering accuracy. The second part of this thesis is dedicated to addressing the challenges in identifying the security patches at machine code (or binary) level. Understanding of vulnerabilities and their corresponding patches at binary level allows us to generate better vulnerability signatures that are later used to detect unpatched vulnerabilities in the commercial off-the-shelf (COTS) binaries. In this work, we propose a scalable binary-level patch analysis framework, named SPAIN, which can automatically identify security patches and summarize patch patterns and their corresponding vulnerability patterns. The summarized patterns can be used to search similar patches or vulnerabilities in binary programs. Our experimental results on several real-world projects have shown that SPAIN can identify security patches with high accuracy and scalability. Using SPAIN, we have summarized the vulnerability and the corresponding patch patterns for five major class of vulnerabilities including use-after-free, double free, integer under(over) flow, null pointer dereference and buffer overflow. The third part of this thesis is dedicated to addressing the challenges in performing cross-architecture cross-platform binary matching. Architecture and platform agnostic binary matching is a vital component in detecting unpatched vulnerabilities, where vulnerability patterns are obtained using patch analysis (in our case, SPAIN) and they are matched against the binaries in the wild. Different from source code clone detection, clone detection (similar code search) in binary executables faces big challenges due to the significant differences in the syntax and the structure of binary code that result from different configurations of compilers, architectures and OSs. To address these challenges, in this work, we propose BINGO - a scalable and robust binary search engine supporting various architectures and operating systems. The key contributions include binary emulation and selective function inlining technique to capture the complete program semantics. Further, architecture and OS neutral function filtering is proposed to improve the scalability, where it removes the irrelevant functions from the matching process. Besides, we also introduce variant length partial traces to model binary functions in a program structure agnostic fashion. The experimental results show that BINGO can find semantic similar functions across architecture and OS boundaries, even with the presence of program structure distortion, in a scalable manner. Using BINGO, we also discovered a zero-day vulnerability in Adobe PDF Reader, a COTS binary. The final part of this thesis is dedicated to identifying potential attack surfaces within the software applications, where identifying potentially vulnerable locations in a code base is critical a pre-step for effective vulnerability assessment; i.e., it can greatly help security experts to put their time and effort where it is needed the most. In this work, we propose and implement a generic, lightweight and extensible framework, named Leopard, to identify attack surfaces at the~function level through program metrics. Leopard does not require any prior knowledge about known vulnerabilities. It works in two steps by combining two sets of systematically derived metrics. First, it uses complexity metrics to group the functions in a target application into a set of bins. Then, it leverages vulnerability metrics to rank the functions in each bin and identifies the top ones as potentially vulnerable. Experimental results on ten real-life projects have demonstrated that Leopard can cover 66% of vulnerabilities by identifying 20% of functions as vulnerable.Doctor of Philosoph

    Accurate and scalable cross-architecture cross-OS binary code search with emulation

    No full text
    Different from source code clone detection, clone detection (similar code search) in binary executables faces big challenges due to the gigantic differences in the syntax and the structure of binary code that result from different configurations of compilers, architectures and OSs. Existing studies have proposed different categories of features for detecting binary code clones, including CFG structures, n-gram in CFG, input/output values, etc. In our previous study and the tool BinGo, to mitigate the huge gaps in CFG structures due to different compilation scenarios, we propose a selective inlining technique to capture the complete function semantics by inlining relevant library and user-defined functions. However, only features of input/output values are considered in BinGo. In this study, we propose to incorporate features from different categories (e.g., structural features and high-level semantic features) for accuracy improvement and emulation for efficiency improvement. We empirically compare our tool, BinGo-E, with the pervious tool BinGo and the available state-of-the-art tools of binary code search in terms of search accuracy and performance. Results show that BinGo-E achieves significantly better accuracies than BinGo for cross-architecture matching, cross-OS matching, cross-compiler matching and intra-compiler matching. Additionally, in the new task of matching binaries of forked projects, BinGo-E also exhibits a better accuracy than the existing benchmark tool. Meanwhile, BinGo-E takes less time than BinGo during the process of matching.NRF (Natl Research Foundation, S’pore
    corecore